Based on 2022 Singapore HDB resale price (real-life) data sets, your team is supposed to construct a multiple regression model (for one particular district) to explain the HDB resale price (ResalePrice) in dollars with the given independent variables.
Documentation and Presentation: 10 marks
Methodology: 10 marks
R-codes, computer outputs interpretation and graphical explanations: 15 marks
Recommendations and conclusions: 15 marks
Written Report: PDF Format: Within 10pages excluding the cover page and Appendix
Appendix: codes with computer outputs
You are required to provide the detailed documentation of how you search your recommended model for inference purpose and justify each step in your data analysis. You are also expected to provide model assumption justification and hypothesis testing evidences (R-codes and computer outputs) with clear explanations that your recommended model is the best model among all the models considered according to BIC criterion. Based on your final recommended model, state clearly your recommendations and conclusions.
Load Data In
Sengkang1 <- read.csv("data/Sengkang2023P.csv", stringsAsFactors = TRUE)
head(Sengkang1, 5)
## Date Type Block Street Story Area Model LeaseBegin LeaseRemain
## 1 2022-01 3 ROOM 331C ANCHORVALE ST 10 TO 12 67 Model A 2015 92 years 09 months
## 2 2022-01 3 ROOM 209C COMPASSVALE LANE 16 TO 18 67 Model A 2011 88 years 11 months
## 3 2022-01 3 ROOM 211C COMPASSVALE LANE 10 TO 12 68 Model A 2013 90 years 02 months
## 4 2022-01 3 ROOM 467B FERNVALE LINK 04 TO 06 68 Model A 2016 93 years 08 months
## 5 2022-01 3 ROOM 414B FERNVALE LINK 07 TO 09 68 Model A 2016 93 years 01 month
## ResalePrice
## 1 420000
## 2 418000
## 3 410000
## 4 382000
## 5 410000
library(httr)
geocode <- function(block, streetname) {
base_url <- "https://developers.onemap.sg/commonapi/search"
address <- paste(block, streetname, sep = " ")
query <- list("searchVal" = address,
"returnGeom" = "Y",
"getAddrDetails" = "N",
"pageNum" = "1")
res <- GET(base_url, query = query)
restext<-content(res, as="text")
output <- jsonlite::fromJSON(restext) %>%
as.data.frame() %>%
dplyr::select("results.LATITUDE", "results.LONGITUDE")
return(output)
}
Sengkang$LATITUDE <- 0
Sengkang$LONGITUDE <- 0
for (i in 1:nrow(Sengkang)){
temp_output <- geocode(Sengkang[i, 3], Sengkang[i, 4])
Sengkang$LATITUDE[i] <- temp_output$results.LATITUDE
Sengkang$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
write_rds(Sengkang, file ="Sengkang_coords")
Sengkang <- read_rds("Sengkang_coords")
Sengkang_sf <- st_as_sf(Sengkang, coords = c("LONGITUDE", "LATITUDE"), crs = 4326) |>
st_transform(crs = 3414)
glimpse(Sengkang_sf)
## Rows: 2,116
## Columns: 11
## $ Date <chr> "2022-01", "2022-01", "2022-01", "2022-01", "2022-01", "2022-01", "2022-01", "…
## $ Type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM"…
## $ Block <chr> "331C", "209C", "211C", "467B", "414B", "467A", "414B", "414B", "453A", "472B"…
## $ Street <chr> "ANCHORVALE ST", "COMPASSVALE LANE", "COMPASSVALE LANE", "FERNVALE LINK", "FER…
## $ Story <chr> "10 TO 12", "16 TO 18", "10 TO 12", "04 TO 06", "07 TO 09", "19 TO 21", "16 TO…
## $ Area <int> 67, 67, 68, 68, 68, 68, 68, 68, 67, 68, 67, 67, 68, 67, 67, 68, 68, 68, 67, 68…
## $ Model <chr> "Model A", "Model A", "Model A", "Model A", "Model A", "Model A", "Model A", "…
## $ LeaseBegin <int> 2015, 2011, 2013, 2016, 2016, 2016, 2016, 2016, 2015, 2016, 2015, 2015, 2017, …
## $ LeaseRemain <chr> "92 years 09 months", "88 years 11 months", "90 years 02 months", "93 years 08…
## $ ResalePrice <int> 420000, 418000, 410000, 382000, 410000, 408000, 412000, 400000, 405000, 350000…
## $ geometry <POINT [m]> POINT (34282.7 41941.51), POINT (35328.82 40660.61), POINT (35341.91 407…
library(tmap)
sengkang_sp <- Sengkang_sf %>%
dplyr::select(3:11) %>%
as_Spatial()
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(sengkang_sp) +
tm_dots()
From the regression above, we know that there are too many categorical variables to consider. We have to condense them accordingly.
length(unique(Sengkang$Block))
## [1] 521
length(unique(Sengkang$Street))
## [1] 29
length(unique(Sengkang$LeaseRemain))
## [1] 226
We can see from the code chunk above that block will have 521, street has 29 and LeaseRemain has 226 categorical variables
We will use the lubridate package to adjust the year and calculate lease years used as a numeric rather than a categorical variable. Since years_used and LeaseRemain are perfected correlated, we will drop LeaseRemaind from the dataframe.
site, the HDB blocks are numbered by 100+, 200+, 300+ and 400+ in Rivervale, Compassvale, Anchorvale and Fernvale respectively.
df <- Sengkang1 %>%
mutate(Date = lubridate::ym(Date),
LeaseBegin = lubridate::ym( paste0(LeaseBegin,"-01")),
years_used = as.numeric((Date - LeaseBegin)/365),
subzone = ifelse(grepl("^1", Block), "Rivervale",
ifelse(grepl("^2", Block), "Compassvale",
ifelse(grepl("^3", Block), "Anchorvale",
ifelse(grepl("4", Block), "Fernvale", "others")))),
.before = Street) %>%
mutate(subzone= as.factor(subzone),
Date = as.factor(Date)) %>%
select(-LeaseRemain, -LeaseBegin)
Using leaseremain as a factor will generate too many binary variables. Convert them into years_used would be easier. Date and LeaseBegin variables must be in date type before substracting between the two. The output would be in (drtn) days and thus we have to set it to numeric set to years.
plot(df$Date, df$ResalePrice)
plot(df$Type, df$ResalePrice)
plot(df$Block, df$ResalePrice)
plot(df$years_used, df$ResalePrice)
plot(df$subzone, df$ResalePrice)
plot(df$Street, df$ResalePrice)
plot(df$Story, df$ResalePrice)
plot(df$Area, log(df$ResalePrice))
plot(df$Model, df$ResalePrice)
We will first calculate the BIC of the regression of all variables.
reg_all <- lm(ResalePrice ~ ., data = df)
BIC(reg_all)
## [1] 52085.02
r <- residuals(reg_all)
plot(df$Date, r,
xlab = "Date", ylab = "Residuals")
plot(df$Type, r,
xlab = "Type", ylab = "Residuals")
plot(df$Block, r,
xlab = "Block", ylab = "Residuals")
plot(df$years_used, r,
xlab = "years_used", ylab = "Residuals")
plot(df$subzone, r,
xlab = "subzone", ylab = "Residuals")
plot(df$Street, r,
xlab = "Street", ylab = "Residuals")
plot(df$Story, r,
xlab = "Story", ylab = "Residuals")
plot(df$Area, r,
xlab = "Area", ylab = "Residuals")
plot(df$Model, r,
xlab = "Model", ylab = "Residuals")
From the residual plots above, we notice all plots points are scattered
around the residual =0. This suggesrt that the model assumptions are not
violated. We have three types of location columns now, Block, Street and
subzone. There should be high correlation between the X variables for
these 3 variables, we will test the BIC number by dropping each variable
out and picking the model with the lowest BIC ## Removing Block
L1 <- lm(ResalePrice ~ .-Block, data = df)
L2 <- lm(ResalePrice ~ .-Street,
data = df)
L3 <- lm(ResalePrice ~ .-subzone,
data = df)
It would seem removing Block would be the best. Now we will compare between street and subzone ## Removing Block and Street
L4 <- lm(ResalePrice ~ .-Block-Street,
data = df)
L5 <- lm(ResalePrice ~ .-Block-subzone, data = df)
BIC_location <- data.frame(lm = c(".-Block",".-Street",".-subzone",".-Block-Street",".-Block-subzone"),
BIC = c(BIC(L1),BIC(L2),BIC(L3),BIC(L4),BIC(L5)))
BIC_location
## lm BIC
## 1 .-Block 49895.90
## 2 .-Street 52085.02
## 3 .-subzone 52085.02
## 4 .-Block-Street 50966.23
## 5 .-Block-subzone 49941.80
plot(L1)
Note: We will be using L1 as the main regression and adding on to that regression. We know that the model is linear as checked under residual plots assumptions.
L6 <- lm(ResalePrice ~ .-Block,
data = df)
summary(L6)
##
## Call:
## lm(formula = ResalePrice ~ . - Block, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125199 -18458 -391 16605 127493
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210695.6 22823.2 9.232 < 2e-16 ***
## Date2022-02-01 3068.6 3127.6 0.981 0.326637
## Date2022-03-01 9561.5 3012.4 3.174 0.001525 **
## Date2022-04-01 19707.9 2957.8 6.663 3.43e-11 ***
## Date2022-05-01 22649.6 3022.6 7.493 9.91e-14 ***
## Date2022-06-01 30197.7 3101.4 9.737 < 2e-16 ***
## Date2022-07-01 34030.9 2899.1 11.738 < 2e-16 ***
## Date2022-08-01 35921.4 2959.2 12.139 < 2e-16 ***
## Date2022-09-01 46776.8 2968.3 15.759 < 2e-16 ***
## Date2022-10-01 54208.3 3198.9 16.946 < 2e-16 ***
## Date2022-11-01 56573.0 3172.8 17.831 < 2e-16 ***
## Date2022-12-01 63968.7 3084.1 20.742 < 2e-16 ***
## Type4 ROOM 49589.9 8541.6 5.806 7.41e-09 ***
## Type5 ROOM 111718.7 14907.3 7.494 9.85e-14 ***
## years_used -7932.9 170.6 -46.503 < 2e-16 ***
## subzoneCompassvale 41771.1 8999.2 4.642 3.67e-06 ***
## subzoneFernvale -19184.9 4668.2 -4.110 4.12e-05 ***
## subzoneRivervale -34282.0 8746.1 -3.920 9.16e-05 ***
## StreetANCHORVALE DR 31753.6 6185.3 5.134 3.11e-07 ***
## StreetANCHORVALE LANE -12441.5 7709.0 -1.614 0.106706
## StreetANCHORVALE LINK 12221.1 4386.9 2.786 0.005388 **
## StreetANCHORVALE RD -361.1 4641.1 -0.078 0.937990
## StreetANCHORVALE ST 4879.3 5521.0 0.884 0.376923
## StreetCOMPASSVALE BOW 49697.4 10253.2 4.847 1.35e-06 ***
## StreetCOMPASSVALE CRES -35722.9 9568.3 -3.733 0.000194 ***
## StreetCOMPASSVALE DR 31509.0 9913.0 3.179 0.001502 **
## StreetCOMPASSVALE LANE -31117.6 9936.8 -3.132 0.001763 **
## StreetCOMPASSVALE LINK 65221.1 10239.0 6.370 2.33e-10 ***
## StreetCOMPASSVALE RD 5531.1 10588.8 0.522 0.601478
## StreetCOMPASSVALE ST -34202.2 10359.9 -3.301 0.000978 ***
## StreetCOMPASSVALE WALK 2385.9 10932.0 0.218 0.827260
## StreetFERNVALE LANE 11616.4 7948.8 1.461 0.144058
## StreetFERNVALE LINK 8008.5 4145.7 1.932 0.053525 .
## StreetFERNVALE RD 19886.0 4575.1 4.347 1.45e-05 ***
## StreetFERNVALE ST 6151.8 6283.4 0.979 0.327671
## StreetJLN KAYU 6159.1 9980.8 0.617 0.537240
## StreetRIVERVALE CRES 10162.6 9429.7 1.078 0.281284
## StreetRIVERVALE DR 46191.2 9828.5 4.700 2.78e-06 ***
## StreetRIVERVALE ST 48896.7 11317.6 4.320 1.63e-05 ***
## StreetRIVERVALE WALK 73305.2 11396.4 6.432 1.56e-10 ***
## StreetSENGKANG CTRL 70136.4 10972.0 6.392 2.02e-10 ***
## StreetSENGKANG EAST AVE 66487.3 10535.5 6.311 3.39e-10 ***
## StreetSENGKANG EAST RD 1205.1 13776.8 0.087 0.930305
## StreetSENGKANG EAST WAY 46129.6 6596.5 6.993 3.62e-12 ***
## StreetSENGKANG WEST AVE 31509.1 9286.4 3.393 0.000704 ***
## StreetSENGKANG WEST WAY NA NA NA NA
## Story04 TO 06 25592.7 2314.5 11.058 < 2e-16 ***
## Story07 TO 09 42931.7 2287.7 18.767 < 2e-16 ***
## Story10 TO 12 48919.7 2365.3 20.682 < 2e-16 ***
## Story13 TO 15 57804.1 2288.3 25.261 < 2e-16 ***
## Story16 TO 18 66014.4 2852.9 23.139 < 2e-16 ***
## Story19 TO 21 62481.2 4257.8 14.674 < 2e-16 ***
## Story22 TO 24 70526.8 4802.6 14.685 < 2e-16 ***
## Story25 TO 27 64799.3 6652.8 9.740 < 2e-16 ***
## Area 2752.4 327.1 8.413 < 2e-16 ***
## ModelModel A 10424.3 2968.2 3.512 0.000454 ***
## ModelModel A2 52537.4 5374.8 9.775 < 2e-16 ***
## ModelPremium Apartment 20248.3 2601.4 7.784 1.11e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29140 on 2059 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8857
## F-statistic: 293.6 on 56 and 2059 DF, p-value: < 2.2e-16
BIC(L6)
## [1] 49895.9
library(fitdistrplus)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: survival
fnorm <- fitdist(residuals(L1), "norm")
## Warning in sqrt(diag(varcovar)): NaNs produced
## Warning in sqrt(1/diag(V)): NaNs produced
## Warning in cov2cor(varcovar): diag(.) had 0 or NA entries; non-finite result is doubtful
result <- gofstat(fnorm, discrete = FALSE)
result
## Goodness-of-fit statistics
## 1-mle-norm
## Kolmogorov-Smirnov statistic 0.03274423
## Cramer-von Mises statistic 0.67139103
## Anderson-Darling statistic 4.71427005
##
## Goodness-of-fit criteria
## 1-mle-norm
## Akaike's Information Criterion 49455.78
## Bayesian Information Criterion 49467.09
KScritvalue <-1.36/sqrt(length(df$Date))
KScritvalue
## [1] 0.02956522
summary(fnorm)
## Fitting of the distribution ' norm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## mean -1.540467e-12 NaN
## sd 2.874631e+04 102.2332
## Loglikelihood: -24725.89 AIC: 49455.78 BIC: 49467.09
## Correlation matrix:
## mean sd
## mean 1 NaN
## sd NaN 1
plot(fnorm)
Since KS statistics = 0.032744 > 0.02956522=Kcrit, we can reject the null hypothesis and thus the data do provide sufficient evidence to show the normal model is the appropriate model.